Separating the articles of authors with the same name 



Jose M. Soler* 

Departamento de Fisica de la Materia Condensada, C-III, 

Umversidad Autonoma de Madrid, E-28049 Madrid, Spain 
(Dated: February 1, 2008) 

I describe a method to separate the articles of different authors with the same name. It is based 
on a distance between any two publications, defined in terms of the probability that they would have 
as many coincidences if they were drawn at random from all published documents. Articles with 
a given author name are then clustered according to their distance, so that all articles in a cluster 
belong very likely to the same author. The method has proven very useful in generating groups 
of papers that are then selected manually. This simplifies considerably citation analysis when the 
author publication lists are not available. 



Citation analysis has become an essential tool for re- 
search evaluation^ Generally, the evaluation referees are 
provided with a list of publications of the individuals or 
groups to be evaluated, although frequently these are in a 
format (say, on paper) that is not easy to use for searches 
in citation databases^ Furthermore, the widespread ac- 
cessibility of these databases to the full research com- 
munity has estimulated less formal evaluations, in which 
publication lists are not available. In such cases, the 
publication lists themselves must be generated from the 
databases, complementing the author names with their 
affiliations and research fields. When even these are not 
well known (say, because only the last affiliation and re- 
search field are known) the search must be based on the 
author name only. This poses the problem of extracting 
the articles of the desired author, among those of other 
authors with the same name. 

In this work I address this problem by denning a dis- 
tance between any two given articles, based on the co- 
incidences between them. This allows to cluster related 
articles, so that all the articles of a cluster are likely to 
belong to the same author. This reduces the problem to 
that of selecting the apropriate clusters, rather than each 
individual article. 

Distances between documents have been proposed on 
the basis of coincidences of words and phrases as well 
as n-grams (sequences of n consecutive characters)^ and 
these distances have been used for a wide range of tasks, 
like language classification, or collecting documents on a 
given subject. In the present case, we are interested in 
relating documents whose full text is usually not avail- 
able, while their abstract is generally available but rel- 
atively expensive to handle in terms of database access 
and storage. Instead, documents are characterized by a 
record with a variety of fields, like author names and ad- 
dresses, title, research field, keywords, journal and year 
of publication, etciS Since coincidences in all these fields 
are significant for identifying their authors, the problem 
arises of how to combine them in a consistent way. Thus, 
one needs to answer questions like: are two papers 'closer' 
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if they were published in the same journal or if they have 
n common words in their titles? Or if they have a com- 
mon coauthor? 

To solve this problem, I will propose the following gen- 
eral idea: imagine that you draw two documents at ran- 
dom from the entire database of Np documents. The 
probability that they coincide in everything (that is, that 
the same document is drawn twice) is obviously 1/Nd- 
The probability that they coincide in any given feature 
is also well defined in principle. For example, if rij of 
the documents in the database were published in a given 
journal j, the probability that the two random articles 
were published in that journal is (nj/Nu) 2 . The prob- 
ability that the two random articles had a journal-of- 
publication coincidence less or equal likely than that is 
^2^Jj(ni/Nr)) 2 , with the Nj journals ordered by decreas- 
ing order of their number of articles in the database. 

Then I will define the distance D^ between two docu- 
ments i and j by 

Ai=log 10 (^)-log 10 (l/iVx 3 ) (1) 

where Py is the probability that two random documents 
would have overall coincidences less or equal likely than 
those between i and j. Clearly, i = j => Py = 1/Nd and 
= 0. On the other extreme, if i and j do not coincide 
in anything, then Py — 1 and Dy — \og w (Nr)) will be 
maximum. 

Obviously, Py is highly nontrivial to calculate, espe- 
cially for multiple, correlated coincidences. However, it 
turns out that very crude approximations still lead to 
meaningful distances that are useful for our purposes. 
Therefore, as a first approach, I will make two extremely 
crude approximations: 1) assume that all possible values 
of a given field (say author names, like R. Smith and J. 
M. S. Torroja) are equally probable; and 2) ignore any 
correlations between different coincidences (like address 
words Harvard and Massachusetts). I will divide each 
field in 'words', and allow only one instance of each word 
within the field (that is, if the word Spain appears twice 
in the list of author addresses, I will take it only once). 
Some words, like articles and prepositions of the title, will 
be excluded. Thus, each field will be characterized by an 
estimated number of possible word values occurring in it. 
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log 10 (Size) 


Documents (No) 


8.0 


Author names 
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Email 
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Title words 
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Research field 
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Journal 


2.0 


Publication year 


1.0 



TABLE I: Assumed number of possible values taken by the 
different fields that characterize a document record from the 
ISI- Thomson Web of Knowledge database 2 . 

For example, if the estimated number of journals is Nj, 
the approximated probability that they are equal for two 
random articles is 1/Nj. More generally if the estimated 
number of possible word values in a field is N, and there 
are ni and rij different words in that field of articles i 
and j, the probability that exactly ny of them coincide 
(in any order) is 

p(riij\rii,nj,N) = 

njl nj! (N-m)\ (N-nj)] (r> \ 

N\ rnp. (rii-riij)! (rij-riij)! (N—n^nJ+n^J\ \ > 

which is the probability of getting Ujj common balls from 
two independent random extractions of n, and rij balls 
out of a set of N different balls. The probability of getting 
at least coincidences is simply P(nij\m, rij, N) = 1 — 

Xm=i 1 p( n \ n ii n ji N). Then, ignoring also correlations 
between different fields, I will approximate the distance 
between i and j by 

D ij * log 10 (N D ) + J2 lo Sio (P(n{j\n{, ^, JV*)) (3) 

/=! 

where / indexes the Np different record fields. 

Table U] shows the estimated number of possible val- 
ues for the fields provided by the standard records of the 
ISI-Thomson Web of Knowledge 2 (excluding 'abstract' 
and 'cited references'). Notice that most of the assumed 
values are much lower than the true number of possi- 
ble options. Rather, they are set so that 1/N is roughly 
the probability of the most frequent word in that field 
(i. e. ~ 10~ 3 is the estimated probability of an author 
name like R. Smith). Even thus, when two articles are 
'close' (i. e. when they belong to the same author), the 
neglect of correlations implies a large underestimation of 
the probability of the combined coincidences, making 
negative. The important point, however, is that, when 
the two articles do not belong to the same author, the co- 
incidences are rarely sufficient to make < 2, which is 
what one would expect for the probability Pjj ~ 10 2 /Np, 
that two random articles belong to the same author (as- 
suming that the average author has published ~ 10 2 ar- 
ticles) . 



It is not unfrequent that an author changes the affil- 
iation and, simultaneously, the field of research (for ex- 
ample after finishing the PhD). Still, it is common that 
she/he publishes a pending work in the former field (and 
perhaps with some of the former coauthors) but using 
already the new affiliation. In this case, it is possible to 
trace the common author identity in the two groups of 
apparently unrelated papers. To allow this, I define a 
new set of distances as 



= mm(D' ik + D' kj ), where D\- = max(D„, 0) (4) 



where k runs over all the papers with the given author 
name. A similar redefinition of distances has been pro- 
posed for nonlinear dimensionality reduction^*^ where k 
was restricted to a small neighborhood of i and j. In the 
present case, however, distances are strongly non Euclid- 
ian and multidimensional scaling" has not proven partic- 
ularly useful. 

The problem of classifying or clustering a set of ele- 
ments according to their distances is highly nontrivial. 6 
In our case, however, this task is facilitated by the neglec- 
tion of correlations and the subsequent underestimation 
of distances between articles of the same author, since 
this creates a large gap between these distances and those 
among different authors. In practice, I simply make clus- 
ters of papers that have zero distance (notice that the 
definition of dij implies that all the distances among the 
cluster members must be zero). The resulting clusters of 
papers, generated with the values of Tabled tend to give 
some 'false negatives' (i. e. different clusters that belong 
to the same author) but rarely 'false positives' (papers 
of different authors within the same cluster) , except per- 
haps for the most common author names (for these, it 
may be necessary to increase Np , or to decrease the other 
values of Table U in order to increase the distances). 

The clusters are then presented interactively (by show- 
ing one or more representative papers of the cluster), in 
different possible orders, for their selection or rejection. 
Other clues, like the period of publication of the cluster 
papers, or the distance to previously selected clusters, are 
also provided to help in the selection. Thus, in most cases 
it is very obvious which clusters must be selected, and the 
selection process is very fast and straightforward. Once 
the largest clusters have been considered, it is convenient 
to swicth to an order of presentation by increasing dis- 
tance to the selected papers and, as soon as this distance 
becomes larger than ~ 3, the remaining clusters may be 
rejected altogether. This is important since, most gener- 
ally, the main inconvenience is the large number of small 
clusters (many of them with a single paper) that appar- 
ently belong to different authors. The following shows 
the begining of the selection dialog for a typical case of 
intermediate complexity (for the name of this author, 
Soler JM): 
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Group 1 has 99 papers and 4364 citations in period 1981-2006 

Distance to selected groups is ****** A sample paper is 
Title: Density-functional method for very large systems with LCA0 basis sets 
Authors: SanchezPortal , D; Ordejon, P; Artacho, E; Soler, JM; 
Source: Int. J. Quantum Chem. (1997) 65, 453:461 
Address words: AUT0N0MA MADRID FIS MAT CONDENSADA E-28049 SPAIN 

NICOLAS CABRERA 0VIED0 E-33007 
Select this group? (y I n I u I all I none I p I c I d I (number) I help) : 



The first group of papers is mine, without any false pos- 
itives. In this case there are neither false negatives (i. 
c. none of the papers in the other groups are mine), 
although this is not the most usual case. 

In summary, a practical algorithm has been presented 
for separating the papers of an author from those of other 
authors with the same name. It semi-automates the sep- 
aration process by creating clusters of papers that most 
likely belong to the same author, thus simplifying greatly 
the generation of an author publication list. 

I would like to acknowledge very useful discussions 
with J. V. Alvarez, R. Garcia, J. Gomez-Herrero, L. Seijo, 
and F. Yndurain. This work has been founded by Spain's 
Ministery of Science through grant BFM2003-03372. 

APPENDIX: HOW TO GET AND PROCESS AN 
ISI-THOMSON SCI FILE 

In order to find in practice the merit indicators of an 
author, one can follow these steps: 

1. Download the programs filter and merit from this 
author's web page^ and compile them if necessary. 

2. Perform a "General search" in the ISI-Thomson 
Web of Science database^ for the author's name. 
Appropriate filters may be set already in this step, 
if desired. 



3. Select the records obtained. Usually the easiest way 
is to check "Records from 1 to last_one" and click on 
"ADD TO MARKED LIST" (if you find too many 
articles, you may have to mark and save them by 
parts, say (1-500) ->filel, (501-last_onc)->file2); 

4. Click on "MARKED LIST" . 

5. Check the boxes "Author(s)", "Title", "Source", 
"keywords", "addresses", "cited reference count", 
"times cited" , "source abbrev." , "page count" , and 
"subject category". Do not check "Abstract" nor 
"cited references" , since this would slow down con- 
siderably the next step. 

6. Click on "SAVE TO FILE" and save it in your com- 
puter. 

7. Click on "BACK" , then on "DELETE THIS LIST" 
and "RETURN" , and go to step 2 to make another 
search, if desired. 

8. Use the filter program to help in selecting the pa- 
pers of the desired author. Mind for hidden file ex- 
tensions, possibly added by your navigator, when 
giving file names in this and next step. 

9. Run the merit program to find the merit indicators. 
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